Optimal Computational Trade-Off of Inexact Proximal Methods 



Pierre Machart Sandrine Anthoine 

LIF, LSIS, CNRS LATP, CNRS, Aix-Marseille University 
Aix-Marseille University anthoineOcmi.univ-mrs.fr 
pierre .machart@lif .univ-mrs.fr 

Luca Baldassarre 
LIONS, Ecole Polytechnique Federale de Lausanne 
luca.baldassarre@epf 1 . ch 

October 23, 2012 



Abstract 

In this paper, we investigate the trade-off between convergence rate and computational cost when mini- 
mizing a composite functional with proximal-gradient methods, which are popular optimisation tools in 
machine learning. We consider the case when the proximity operator is computed via an iterative proce- 
dure, which provides an approximation of the exact proximity operator. In that case, we obtain algorithms 
with two nested loops. We show that the strategy that minimizes the computational cost to reach a so- 
lution with a desired accuracy in finite time is to set the number of inner iterations to a constant, which 
differs from the strategy indicated by a convergence rate analysis. In the process, we also present a new 
procedure called SIP (that is Speedy Inexact Proximal-gradient algorithm) that is both computationally 
efficient and easy to implement. Our numerical experiments confirm the theoretical findings and suggest 
that SIP can be a very competitive alternative to the standard procedure. 



1 Introduction 

Recent advances in machine learning and signal processing have led to more involved optimisation prob- 
lems, while abundance of data calls for more efficient optimization algorithms. First-order methods are now 
extensively employed to tackle these issues and, among them, proximal-gradient algorithms p~6| . f25 | 16] are 
becoming increasingly popular. They make it possible to solve very general convex non-smooth problems 
of the following form: 

mmf(x) := g(x) + h(x), (1) 

X 

where g : E" — s- R is convex and smooth with an L-hipschitz continuous gradient and h : E™ — s- E is 
lower semi-continuous proper convex, with remarkably simple, while effective, iterative algorithms which 
are guaranteed [6] to achieve the optimal convergence rate of 0(l/k 2 ), for a first order method, in the 
sense of [23] . They have been applied to a wide range of problems, from supervised learning with sparsity- 
inducing norm [T3] H] [23J , imaging problems [101 03 [T7] , matrix completion [ST] , sparse coding [TI5] 
and multi-task learning [T3] . 

The heart of these procedures is the proximity operator. In the favorable cases, analytical forms exist. 
However, there are many problems, such as Total Variation (TV) denoising and deblurring [llj . non-linear 
variable selection [33] , structured sparsity [THl H] , trace norm minimisation [HI [H] , matrix factorisation 
problems such as the one described in |28j , where the proximity operator can only be computed numerically, 
giving rise to what can be referred to as inexact proximal- gradient algorithms [551 EO] • 

Both theory and experiments show that the precision of those numerical approximations has a fun- 
damental impact on the performance of the algorithm. A simple simulation, experimenting different 
strategies for setting this precision, on a classical Total Variation image deblurring problem (see Section 
H for more details) highlights two aspects of this impact. Fig. [I] depicts the evolution of the objective 
value (hence precision) versus the computational cost (i.e. running time, see Section for a more formal 
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definition). The different curves are obtained by solving the exact same problem of the form Q, using, 
along the optimization process, either a constant precision (for different constant values) for the computa- 
tion of the proximity operator, or an increasing precision (that is computing the proximity operator more 
and more precisely along the process) . It shows that the computation cost required to reach a fixed value 
of the objective value varies greatly between the different curves (i.e. strategies). That means that the 
computational performance of the procedure will dramatically depend on the chosen strategy. Moreover, 
we can see that the curves do not reach the same plateaus, meaning that the different strategies cause the 
algorithm to converge to different solutions, with different precisions. 

Meanwhile, designing learning procedures which ensure good generalization properties is a central and 
recurring problem in Machine Learning. When dealing with small-scale problems, this issue is mainly 
covered by the traditional trade-off between approximation (i.e. considering the use of predictors that 
are complex enough to handle sophisticated prediction tasks) and estimation (i.e. considering the use 
of predictors that are simple enough not to overfit on the training data). However, when dealing with 
large-scale problems, the amount of available data can make it impossible to precisely solve for the optimal 
trade-off between approximation and estimation. Building on that observation, [7] has highlighted that 
optimization should be taken into account as a third crucial component, in addition to approximation and 
estimation, leading to more complex (multiple) trade-offs. 

In fact, dealing with the aforementioned multiple trade-off in a finite amount of computation time 
urges machine learners to consider solving problems with a lower precision and pay closer attention to the 
computational cost of the optimization procedures. This crucial point motivates the study of strategies 
that lead to an approximate solution, at a smaller computational cost, as the figure depicts. However, the 
choice of the strategy that determines the precision of the numerical approximations seems to be often 
overlooked in practice. Yet, in the light of what we have discussed in that introduction, we think it is 
pivotal. In several studies, the precision is set so that the global algorithm converges to an optimum of 
the functional [12] , by studying sufficient conditions for such a convergence. In many others, it is only 
considered as a mere implementation detail. A quick review of the literature shows that many application- 
centered papers seem to neglect this aspect and fail at providing any detail regarding this point (e.g. pQ). 

Recently, some papers have addressed this question from a more theoretical point of view. For instance, 
[28| 130] give conditions on the approximations of the proximity operator so that the optimal convergence 
rate is still guaranteed. However, rate analysis is not concerned by the complexity of computing the 
proximity operator. As a consequence, the algorithms yielding the fastest rates of convergence are not 
necessarily the computationally lightest ones, hence not the ones yielding the shortest computation time. 
In fact, no attempts have yet been made to assess the global computational cost of those inexact proximal- 
gradient algorithms. It is worth mentioning that for some specific cases, other types of proximal-gradient 
algorithms have been proposed that allow to avoid computing complex proximity operator [22\ lllj . 

In Section [2j we start from the results in [25] that link the overall accuracy of the iterates of inexact 
proximal-gradient methods with the errors in the approximations of the proximity operator. We consider 
iterative methods for computing the proximity operator and, in Section[3j we show that if one is interested 
in minimizing the computational cost (defined in Section 3.4) for achieving a desired accuracy, other 
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Figure 1: TV-regularization : Computational cost vs. objective value for different strategies 
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strategies than the ones proposed in [25] and [3U] might lead to significant computational savings. 

The main contribution of our work is showing, in Section [4j that for both accelerated and non- 
accelerated proximal-gradient methods, the strategy minimizing the global cost to achieve a desired 
accuracy is to keep the number of internal iterations constant. This constant depends on the desired 
accuracy and the convergence rate of the algorithm used to compute the proximity operator. Coinciden- 
tally, those theoretical strategies meet those of actual implementations and widely-used packages and help 
us understand both their efficiency and limitations. After a discussion on the applicability of those strate- 
gies, we also propose a more practical one, namely the Speedy Inexact Proximal-gradient (SIP) strategy, 
motivated by our analysis. 

In Section [5j we numerically assess different strategies (i.e. constant numbers of inner iterations, SIP, 
the strategy yielding optimal convergence rates) on two problems, illustrating the theoretical analysis and 
suggesting that our new strategy SIP can be very effective. This leads to a final discussion about the 
relevance and potential limits of our approach along with some hints on how to overcome them. 



2 Setting 

2.1 Inexact Proximal Methods 

To solve problem ([lj, one may use the so-called proximal-gradient methods [25]. Those iterative methods 
consist in generating a sequence {xk}, where 

x k = prox h/L [y k _ t - ^Vg(y k -i)] , 

with y k = x k + f3 k (x k - X k -i), 

and the choice of f3 k gives rise to two schemes: f3 k — for the basic scheme, or some well-chosen sequence 
(see |25l I29| 16] for instance) for an accelerated scheme. The proximity operator prox^y^ is defined as: 

pTOX h / L (z) = argmin j\\x — z\\ 2 + h(x). (2) 

X 

In the most classical setting, the proximity operator is computed exactly. The sequence {x k } then 
converges to the solution of problem 0. However, in many situations no closed-form solution of ([2| is 
known and one can only provide an approximation of the proximal point. From now on, let us denote by 
e k an upper bound on the error induced in the proximal objective function by this approximation, at the 
fc-th iteration: 

^\\x k - z\\ 2 + h(x k ) <e k + min|^ \\x - z\\ 2 + h(x)\ . (3) 

For the basic scheme, the convergence of {x k } to the optimum of Problem has been studied in [If)] 
and is ensured under fairly mild conditions on the sequence {e k }. 



2.2 Convergence Rates 

The authors of [S5J go beyond the study on the convergence of inexact proximal methods: they establish 
their rates of convergence. (This is actually done in the more general case where the gradient of g is also 
approximated. In the present study, we restrict ourselves to error in the proximal part.) 

Let us denote by x* the solution of problem 0. The convergence rates of the basic (non-accelerated) 
proximal method (e.g. y k = x k ) thus reads: 

Proposition 1 (Basic proximal-gradient method (Proposition 1 in [IE]))- For all k > 1, 

2 



f(x k )-f(x*)< 



L 
2k- 



\xq - x 
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Remark 1. In [28j . this bound actually holds on the average of the iterates Xi, i.e. 
/' I T. >.■'■' ]-/(**)< 
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(4) 
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Q thus holds for the iterate that achieve the lowest function value. It also trivially holds all the time for 
algorithms with which the objective is non-increasing. 

The convergence rate of accelerated schemes (e.g. yk = Xk + ^^Xk~i) reads: 
Proposition 2 (Accelerated proximal-gradient method (Proposition 2 in [2"S])). For all k > 1, 

2 



f(x k )-f(x*)< 



2L 



(fc + 1) 2 



k 

\xq — x*\\ + 2y^ i\ 

»=1 



\ 




(5) 



Bounds with faster rates (Proposition 3 and 4 in [28]) can be obtained if the objective is strongly 
convex. Some results will be briefly mentioned in Section [4] in this case. However, we will not detail them 
as much as in the more general setting. 



2.3 Approximation Trade-off 

The inexactitude in the computation of the proximity operator imposes two additional terms in each 
bound, for instance in Q: : 




When the q's are set to (i.e. the proximity operator is computed exacted), one obtains the usual 
bounds of the exact proximal methods. These additional terms (in Q and ([5| resp.) are summable if 
{efc} converges at least as fast as O ( ki 2+s) ) (resp. O ( fc(4 1 +a) )), for any 5 > 0. One direct consequence of 
these bounds (in the basic and accelerated schemes respectively) is that the optimal convergence rates in 
the error-free setting are still achievable, with such conditions on the {efcj's. Improving the convergence 
rate of {e^} further causes the additional terms to sum to smaller constants, hence inducing a faster 
convergence of the algorithm without improving the rate. However, [28 empirically notices that imposing 
too fast a decrease rate on {ek} is computationally counter-productive, as the precision required on the 
proximal approximation becomes computationally demanding. In other words, there is a subtle trade-off 
between the number of iterations needed to reach a certain solution and the cost of those iterations. This 
is the object of study of the present paper. 



3 Defining the Problem 

The main contribution of this paper is to define a computationally optimal way of setting the trade-off 
between the number of iterations and their cost, in various situations. We consider the case where the 
proximity operator is approximated via an iterative procedure. The global algorithm thus consists in an 
iterative proximal method, where at each (outer-)iteration, one performs (inner-)iterations. 

With that setting, it is possible to define (hence optimize) the global computational cost of the algo- 
rithm. If the convergence rate of the procedure used in the inner-loops is known, the main result of this 
study provides a strategy to set the number of inner iterations that minimizes the cost of the algorithm, 
under some constraint upper-bounding the precision of the solution (as defined in Q). 

3.1 The Computational Cost of Inexact Proximal Methods 

As stated earlier, our goal is to take into account the complexity of the global cost of inexact proximal 
methods. Using iterative procedures to estimate the proximity operator at each step, it is possible to 
formally express this cost. Let us assume that each inner-iteration has a constant computational cost 
Cin and that, in addition to the cost induced by the inner-iterations, each outer-iteration has a constant 
computational cost C ut- It immediately follows that the global cost of the algorithm is: 

fe 

i=l 

and the question we are interested in is to minimize this cost. In order to formulate our problem as a 
minimization of this cost, subject to some guarantees on the global precision of the solution, we now need 
to relate the number of inner iterations to the precision of the proximal point estimation. This issue is 
addressed in the following subsections. 
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3.2 Parameterizing the Error 

Classical methods to approximate the proximity operator achieve either sublinear rates of the form O (vsj 
(a = | for sub-gradient or stochastic gradient descent in the general case; a = 1 for gradient and proximal 
descent or a = 2 for accelerated descent/proximal schemes) or linear rates O (( 1 — 7) J (for strongly 
convex objectives or second-order methods). Let U denote the number of inner iterations performed at 
the i-th iteration of the outer-loop. We thus consider two types of upper bounds on the error defined in 



1? 



(sublinear rate) or ^ = ^(1—7)* (linear rate), (7) 



for some positive A^s. 



3.3 Parameterized Bounds 

Plugging Q into Q or ([5]), we can get four different global bounds: 

fix^-fix^KBjik^h}^), j = l,.,4, 

depending on whether we are using a basic or accelerated scheme on the one hand, and on whether we 
have sub-linear or linear convergence rate in the inner-loops on the other hand. More precisely, we have 
the following four cases: 

1. basic out, sub-linear in: 

Bi(k,{U}Ui) = ^[\\^-x*\\+3Y, 

2. basic out, linear in: 



3. accelerated out, sub-linear in: 



B 3 (k,{k}U) = jj^^(\\xo-x*\\ + 



4. accelerated out, linear in: 



*4(fc, {U}U = jj^y (in - *i + 3 £ l f M \ 1)U j 

3.4 Towards a Computationally Optimal Tradeoff 

Those bounds highlight the aforementioned trade-off. To achieve some fixed global error 

P = f(xk) - f(x*) 

, there is a natural trade-off that need to be set by the user, between the number k of outer-iterations and 
the numbers of inner-iterations {h}i =1 , which can be seen as hyper-parameters of the global algorithms. 
As mentioned earlier, and witnessed in |28j the choice of those parameters will have a crucial impact on 
the computational efficiency (see equation (|6|) of the algorithm. 

Our aim to "optimally" set the hyper-parameters (k and {h}i = i) may be conveyed by the following 
optimization problem. For some fixed accuracy p, we want to minimize the global cost C g i b of the 
algorithm, under the constraint that our bound on the error B is smaller than p: 

k 

min C in y^h + kC out s.t. B(k, {h}* =1 ) < p. (8) 



i=l 



This optimization problem the rest of this paper will rest upon. 
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4 Results 

Problem Q is an integer optimization problem as the variables of interest are numbers of (inner and 
outer) iterations. As such, this is a complex (NP-hard) problem and one cannot find a closed form for the 
integer solution, but if we relax our problem in to a continuous one: 



min C in V k + fcCout s.t. B(k, {k} k i=l ) < p, 

fcGN, {i;} fc , G[l,oo) fc *—f 

l — 1 2 — 1 



(9) 



it actually is possible to find an analytic expression of the optimal and to numerically find the 

optimal k. 

4.1 Optimal Strategies 

The next four propositions describe the solution of the relaxed version ^ of Problem ^ in the four 
different scenarios defined in Section [3.3| and for a constant value = A. 

Scenarios 1 and 2: basic out 

Let 



Solving the continuous relaxation of problem ([8| with the bounds B\ and B 2 leads to the following 
propositions: 

Proposition 3 (Basic out, sub-linear in). If p < 6\/2LA\\xo — x*\\, the solution of problem Q) for B = B\ 



is: 



V i,l* = ( C ^) " , with k* = argmmfcC m (^) " + kC out . (10) 



Proposition 4 (Basic out, linear in). If p < 6^2LA(1 — j)\\xq — x*\\, the solution of problem |#|) for 
B = B 2 is: 

V i, I* = ^"^T , with k* = argmin In (^) + kC out . (11) 

ln(l - 7) feeN « ln(l - 7) \ k J 

Scenarios 3 and 4: accelerated out 

Let 

D(k) = (\P^(k + 1) - \\x - x*\\ 

Solving the continuous relaxation ([9]) of problem Q with the bound B% leads to the following proposition: 



Proposition 5 (Accelerated out, sub-linear in). If p < yy 12y/2LA\\x — x* || — 3a/]4 ) , £/ie solution of 
problem /or B = B% is: 

= ( , 2D } k ' 7 ) " . fc * = argmin fcC m ( ^f^\ ) " + fcC ou4 . (12) 



+ 1)^ ' k % N . m \k(k + l) 

A similar result holds for the last scenario: B = B4. However in this case, the optimal U are equal to 
1 up to n(k*) (1 < n(k*) < k*) and then increase with i: 

Proposition 6 (Accelerated out, linear in). If p < I2y/2LA(1 —~^f)\\xo — x* | — %VAj , the solution 
of problem (6) for B = B4 is: 

1 for 1 < i < n(k*) - 1 

/ - } ( t-ntlA »(fc)("(fc)~l) nr— \\ 
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with fc* = argminj kC out + C tn {n(k) - 1) - ln (f/" 7) In (^yr) 



2C,;„(fc-n(fc) + l) , k+l-n(k) II 

l„(l- 7 ) ta L t) . nW(y)-i) ^ I h 




and n(k) is defined as the only integer such that: 

(n(fc) - 1) (2k + 2 - n(k)) y/l - 7 < 2£>(fc) < n(k) (2Jfe + 1 - n(fc)) 
Sketch of proof (For a complete proof, please see appendix^) First note that: 

k k 

min C in U + kC out = min min C in k + kC out . 



i=l 



We can solve problem ^ by first solving, for any fc, the minimization problem over {^}^ =1 . This is done 
using the standard Karush-Kuhn- Tucker approach |20) . Plugging the analytic expression of those optimal 
{li}i=i m to our functional, we get our problem in k. □ 

Remark 2. Notice that the propositions hold for p smaller than a threshold. If not, the analysis and 
results arc different. Since we focus on a high accuracy, we here develop the results for small values of p. 
We defer the results for the larger values of p to appendix [7j 

Remark 3. In none of the scenarios can we provide an analytical expression of k* . However, the expressions 
given in the propositions allow us to exactly retrieve the solution. The functions of k to minimize are 
monotonically decreasing then increasing. As a consequence, it is possible to numerically find the minimizer 
in M, for instance in the first scenario: 

k = argmin fcC in ( — , — ) " + kC out , 
km \ k J 

with an arbitrarily high precision, using for instance a First-Order Method. It follows that the integer 
solution fc* is exactly either the flooring or ceiling of fc. Evaluating the objective for the two possible 
roundings gives the solution. 

Finally, as briefly mentioned, bounds with faster rates can be obtained when the objective is known to 
be strongly convex. In that case, regardless of the use of basic or accelerated schemes and of sub-linear or 
linear rates in the inner loops, the analysis leads to results similar to those reported in Proposition [6] (e.g. 
using 1 inner iteration for the first rounds and an increasing number then). Due to the lack of usability 
and interpretability of these results, we will not report them here. 



4.2 Comments and Interpretation of the Results 
Constant number of inner iterations 

Our theoretical results urge to use a constant number of inner iterations in 3 scenarios. Coincidentally, 
many actual efficient implementations of such two nested algorithms, in [T] or in packages like SLEP^Jor 
PQJ^J use these constant number schemes. However, the theoretical grounds for such an implementation 
choice were not explicited. Our results can give some deeper understanding on why and how those practical 
implementations perform well. They also help acknowledging that the computation gain comes at the cost 
of an intrinsic limitation to the precision of the obtained solution. 



An Integer Optimization Problem 

The impact of the continuous relaxation of the problem in {li]\ =1 is subtle. In practice, we need to set 
the constant number on inner iterations li to an integer number. Setting, Vi € [l,fc*],^ = [£*] ensures 
that the final error is smaller than p. This provides us with an approximate (but feasible) solution to the 
integer problem. 

One may want to refine this solution by sequentially setting li to \ l*\ (hence reducing the computational 
cost), starting from i = 1, while the constraint is met, i.e. the final error remains smaller than p. Refer 
to Algorithm [l] for an algorithmic description of the procedure. 

1 http:/ /www. public. asu.edu/ jye02/Software/SLEP/index.htm 
2 http:/ /www. di.ens.fr/ mschmidt/Software/PQN.html 
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Algorithm 1 A finer grain procedure to obtain an integer solution for the k's 

Require: {l*}f =l 

Vie [1,**U<- 
i <- 1 

repeat 

k <~ [111 
i <- i + 1 
until B(k*,{l i }r i )>p 



Computationally-Optimal vs. Optimal Convergence Rates Strategies 

The original motivation of this study is to show how, in the inexact proximal methods setting, optimization 
strategies that are the most computationally efficient, given some desired accuracy p, are fundamentally 
different from those that achieve optimal convergence rates. The following discussion motivates why 
minding this gap is of great interest for machine learners while an analysis of the main results of this work 
highlights it. 

When one wants to obtain a solution with an arbitrarily high precision, optimal rate methods are 
of great interest: regardless of the constants in the bounds, there always exists a (very high) precision 
p beyond which methods with optimal rates will be faster than methods with suboptimal convergence 
rates. However, when dealing with real large-scale problems, reaching those levels of precision is not 
computationally realistic. When taking into account budget constraints on the computation time, and 
as suggested by [7] , generalization properties of the learnt function will depend on both statistical and 
computational properties. 

At the levels of precision intrinsically imposed by the budget constraints, taking other elements than 
the convergence rates becomes crucial for designing efficient procedures as our study shows. Other ex- 
amples of that phenomenon have been witnessed, for instance, when using Robbins-Monro algorithm 
(Stochastic Gradient Descent). It has been long known (see [26] for instance) that that the use of a 
step-size proportional to the inverse of the number of iterations allows to reach the optimal convergence 
rates (namely 1/fc) . 

On the other hand, using a non-asymptotic analysis [3J, one can prove (and observe in practice) that 
such a strategy can also lead to catastrophic results when k is small (i.e. possibly a large increase of the 
objective value) and undermines the computational efficiency of the whole procedure. 

Back to our study, for the first three scenarios (Propositions [3j [4] and [5]) , the computationally-optimal 
strategy imposes constant number of inner iterations. Given our parameterization, Eq. Q, this also 
means that the errors Ci on the proximal computation remains constant. On the opposite, the optimal 
convergence rates can only be achieved for sequences of e; decreasing strictly faster than 1/i 2 for the 
basic schemes and l/i 4 for the accelerated schemes. Obviously, the optimal convergence rates strategies 
also yield a bound on the minimal number of outer iterations needed to reach precision p by inverting 
the bounds Q or (15]). However, this strategy is provably less efficient (computationally- wise) than the 
optimal one we have derived. 

In fact, the pivotal difference between "optimal convergence rates" and "computationally optimal" 
strategies lies in the fact that the former ones arise from an asymptotic analysis while the latter arise 
from a finite-time analysis. While the former ensures that the optimization procedure will converge to 
the optimum of the problem (with optimal rates in the worst case), the latter only ensures that after k* 
iterations, the solution found by the algorithm is not further than p from the optimum. 

Do not optimize further 

To highlight this decisive point in our context, let us fix some arbitrary precision p. Propositions [3] to [5] 
give us the optimal values k* and {^}f =1 depending on the inner and outer algorithms we use. Now, if 
one wanted to further optimize by continuing the same strategy for k' > k* iterations (i.e. still running 
I* inner iterations), we would have the following bound: 

B(k',{l*}l 1 )>B(k*,{l*}ti)=P- 

In other words, if one runs more than k* iterations of our optimal strategy, with the same U, we can not 
guarantee that the error still decreases. In a nutshell, our strategy is precisely computationally optimal 
because it does not ensure more than what we ask for. 
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4.3 On the Usability of the Optimal Strategies 

Designing computationally efficient algorithms or optimization strategies is motivated by practical con- 
siderations. The strategies we proposed are provably the best to ensure a desired precision. Yet, in a 
setting that covers a very broad range of problems, their usability can be compromised. We point out 
those limitations and propose a solution to overcome them. 

First, these strategies require the desired (absolute) precision to be known. In most situations, it is 
actually difficult, if not impossible, to know in advance which precision will ensure that the solution found 
has desired properties (e.g. reaching some specific SNR ratio for image deblurring). More critically, if it 
turned out that the user-defined precision was not sufficient, we showed that "optimizing further" with 
the same number of inner iterations does not guarantee to improve the solution. For a sharper precision, 
one would technically have to compute the new optimal strategy and run it all over again. 

Although it is numerically possible, evaluating the optimal number of iterations k* still requires to 
solve an optimization problem. More importantly, the optimal values for the numbers of inner and outer 
iterations depend on quantities like ||xo — x*\\ which are unknown and very difficult to estimate. Those 
remarks undermine the direct use of the presented computationally optimal strategies. 

To overcome these problems, we propose a new strategy called Speedy Inexact Proximal- gradient al- 
gorithm (SIP), described in Algorithm [2j which is motivated by our theoretical study and very simple 
to implement. In a nutshell, it starts using only one inner iteration. When the outer objective stops 
decreasing fast enough, the algorithm increases the number of internal iterations used for computing the 
subsequent proximal steps, until the objective starts decreasing fast enough again. 



Algorithm 2 Speedy Inexact Proximal-gradient strategy (SIP) 

Require: An initial point Xq, an update rule A ou t, an iterative algorithm A- ln for computing the proximity 
operator, a tolerance tol > 0, a stopping criterion STOP. 
x <— .To, I 4— 1 
repeat 

x — x — g(x) Gradient Step 
z Q <- 

for i = 1 to I do 

z l = Ain(x, z l ~~ l ) Proximal Step 
end for 

x = z l 

if f(x) - f(x) < tolf(x) then 

I <— I + 1 Increase proximal iterations 
end if 

x = A ut(x, x) Basic or accelerated update 
until STOP is met 



Beyond the simplicity of the algorithm (no parameter except for the tolerance, no need to set a global 
accuracy in advance) , SIP leverages the observation that a constant number of inner iterations I only allows 
to reach some underlying accuracy. As long as this accuracy has not been reached, it is not necessary to 
put more efforts into estimating the proximity operator. The rough idea is that far from the minimum of 
a convex function, moving along a rough estimation of the steepest direction will be very likely to have 
the function decrease fast enough, hence the low precision required for the proximal point estimation. 
On the other hand, when close to the minimum, a much higher precision is required, hence the need for 
using more inner iterations. This point of view meets the one developed in [5] in the context of stochastic 
optimization, where the authors suggest to use increasing batch sizes (along the optimization procedure) 
for the stochastic estimation of the gradient of functional to minimize, in order to achieve computational 
efficiency. 

5 Numerical Simulations 

The objective of this section is to empirically investigate the behaviour of proximal-gradient methods when 
the proximity operator is estimated via a fixed number of iterations. We also assess the performance of 
the proposed SIP algorithm. Our expectation is that a strategy with just one internal iteration will be 
computationally optimal only up to a certain accuracy, after which using two internal iterations will be 
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Figure 2: Deblurring with Total Variation - Basic method (left) and Accelerated method (right) 



more efficient and so on. We consider an image deblurring problem with total variation regularization and 
a semi-supervised learning problem using two sublinear methods for computing the proximity operator. 



5.1 TV-regularization for image deblurring 

The problem of denoising or deblurring an images is often tackled via Total Variation regularization 
[23 HQl IS] . The total variation regularizer allows one to preserve sharp edges and is defined as 

N 

9{x) = A ||(Vi)^|| a 

where A > is a regularization parameter and V is the discrete gradient operator [TU] . We use the smooth 
quadratic data fit term f(x) = \\Ax — y\\^, where A is a linear blurring operator and y is the image to be 
deblurred. This leads to the following problem: 

JV 

mm\\Ax - yg + X ^ ||(Vx) 2J || 2 . 

i,j=l 

Our experimental setup follows the one in [30 , where it was used for an asymptotic analysis. We start with 
the famous Lena test image, scaled to 256 x 256 pixels. A 9 x 9 Gaussian filter with standard deviation 4 
is used to blur the image. Normal noise with zero mean and standard deviation 10 -3 is also added. The 
regularization parameter A was set to 10 -4 . We run the basic proximal-gradient method up to a total 
computational cost of C = 10 6 (where we set C m = C ou t = 1) and the accelerated method up to a cost of 
5 x 10 4 . We computed the proximity operator using the algorithm of [5], which is a basic proximal-gradient 
method applied to the dual of the proximity operator problem. We used a fixed number of iterations and 
compared with the convergent strategy proposed in [28] and the SIP algorithm with tolerance 10~ 8 . As 
a reference for the optimal value of the objective function, we used the minimum value achieved by any 
method (i.e. the SIP algorithm in all cases) and reported the results in Fig. [2] 

As the figures display a similar behaviour for the different problems we ran our simulations on, we 



defer the analysis of the results to 5.3 



5.2 Graph prediction 

The second simulation is on the graph prediction setting of |18j . It consists in a sequential prediction 
of boolean labels on the vertices of a graph, the learner's goal being the minimization of the number of 
mistakes. More specifically, we consider a 1-seminorm on the space of graph labellings, which corresponds 
to the minimization of the following problem (composite l\ norm) 

min \\Ax - y\\ 2 + A||Bx||i, 

X 

where A is a linear operator that selects only the vertices for which we have labels y, B is the edge map 
of the graph and A > is a regularization parameter (set to 10 -4 ). We constructed a synthetic graph of 
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10 4 10 s 10 s 10 3 10* 10 5 

Computational Cost Computational Cost 



Figure 3: Graph learning - Basic method (left) and Accelerated method (right) 

d = 100 vertices, with two clusters of equal size. The edges in each cluster were selected from a uniform 
draw with probability \ and we explicitly connected d/25 pairs of vertices between the clusters. The 
labelled data y were the cluster labels (+1 or —1) of s = 10 randomly drawn vertices. We compute 
the proximity operator of A||i?x||i via the method proposed in 15j, which essentially is a basic proximal 
method on the dual of the proximity operator problem. We follow the same experimental protocol as in 
the total variation problem and report the results in Fig. [3] 

5.3 Why the "computationally optimal" strategies are good but not that 
optimal 

On all the displayed results (Fig. [2] and [3]), and as the theory predicted, we can see that for almost any 
given accuracy p (i.e. F k — F* on the figures), there exists some constant value for that yields a strategy 
that is potentially orders of magnitude more efficient than the strategy that ensures the fastest global 
convergence rate. On any of the figures, comparing the curves obtained with 1 and 2 inner iterations, 
one may notice that the former first increases the precision faster than the latter. Meanwhile, the former 
eventually converges to a higher plateau than the latter. This observation remains as the number of 
constant iterations increases. This highlights the fact that smaller constant values of U lead to faster 
algorithms at the cost of a worse global precision. On the other hand, the SIP strategy seems to almost 
always be the fastest strategy to reach any desired precision. That makes it the most computationally 
efficient strategy as the figures show. This may look surprising as the constant Ws strategies are supposed 
to be optimal for a specific precision and obviously are not. 

In fact, there is no contradiction with the theory: keeping Zj constant leads to the optimal strategies 
for minimizing a bound on the real error, which can be significantly different than directly minimizing the 
error. 

This remark raises crucial issues. If the bound we use for the error was a perfect description of the 
real error, the strategies with constant ij would be the best also in practice. Intuitively, the tighter the 
bounds, the closest our theoretical optimal strategy will be from the actual optimal one. This intuition 
is corroborated by our numerical experiments. In our parametrization of e,-, in a first approximation, 
we decided to consider constant A4 (see equation ([7])). When not using warm restarts between two 
consecutive outer iterations, our model of e, does describe the actual behaviour much more accurately 
and our theoretical optimal strategy seems much closer to the real optimal one. To take warm starts into 
account, one would need to consider decreasing sequences of A^s. Doing so, one can notice that in the 
first 3 scenarios, the optimal strategies would not consist in using constant number of inner iterations any 
longer, but only constant e^'s, hence maintaining the same gap between optimal rates and computationally 
optimal strategies. 

These ideas urge for a finer understanding on how optimization algorithms behave in practice. Our 
claim is that one pivotal key to design practically efficient algorithms is to have new tools such as warm- 
start analysis and, perhaps more importantly, convergence bounds that are tighter for specific problems 
(i.e. "specific-case" analysis rather than the usual "worst-case" ones). 
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6 Conclusion and future work 

We analysed the computational cost of proximal-gradient methods when the proximity operator is com- 
puted numerically. Building upon the results in [28) . we proved that the optimization strategies, using a 
constant number of inner iterations, can have very significant impacts on computational efficiency, at the 
cost of obtaining only a suboptimal solution. Our numerical experiments showed that these strategies do 
exist in practice, albeit it might be difficult to access them. Coincidentally, those theoretical strategies 
meet those of actual implementations and widely-used packages and help us understanding both their 
efficiency and limitations. We also proposed a novel optimization strategy, the SIP algorithm, that can 
bring large computational savings in practice and whose theoretical analysis needs to be further developed 
in future studies. Throughout the paper, we highlighted the fact that finite-time analysis, such as ours, 
urges for a better understanding of (even standard) optimization procedures. There is a need for sharper 
and problem-dependent error bounds, as well as a better theoretical analysis of warm-restart, for instance. 

Finally, although we focused on inexact proximal-gradient methods, the present work was inspired 
by the paper "The Trade-offs of Large-Scale Learning" [7 . Bottou and Bousquet studied the trade-offs 
between computational accuracy and statistical performance of machine learning methods and advocate 
for sacrificing the rate of convergence of optimization algorithms in favour of lighter computational costs. 
At a higher-level, future work naturally includes finding other situations where such trade-offs appear and 
analyze them using a similar methodology 



7 Appendix 
Proof of Proposition [3] 

In this scenario, we use non- accelerated outer iterations and sublinear inner iterations. Our optimisation 
problem thus reads: 



min min C; n > h 



kC Q 



s.t. 




<P- 



Let us first examine the constraint. 




- ||x - x*\\ 



As a first remark, this constraint can be satisfied only if 

k> L\\ Xa - x *f. 

2p 

However this always holds as this only implies that the number of outer iterations k is larger than the 
amount we would need if the proximity operator could be computed exactly. 

Let us recall that for any i, Ai is such that < Ai/lf. For most iterative optimization methods, the 
tightest bounds (of this form) on the error are obtained for constants Ai depending on: a) properties of 
the objective function at hand, b) the initialization. To mention an example we have already introduced, 
for basic proximal methods, one can choose 



A,= 



2h 



\{x k ) Q -x* k \\, 



where (xk)o is the initialization for our inner-problem at outer-iteration k and x* k the optimal of this 
problem. As the problem seems intractable in the most general case, we will first assume that Vi, Ai = A. 
This only implies that we don't introduce any prior knowledge on ||(xfc)o — x* k \\ at each iteration. This 
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is reasonable if, at each outer-iteration, we randomly initialize (xk)o but may lead to looser bounds if we 
use wiser strategies such as warm starts. 

With that new assumption on A i7 one can state that the former constraint will hold if and only if: 



\/ 1° 



— < 



3V2A 



2kp 

1 



\x — x 



Let us first solve the problem of finding the {^}i=i f° r some fixed k. We need to solve: 

k k 



argmin C in ^ k + fcC ou t s.t. V) J — < 
{J<}? = i€N** ~" "~" ' 



i=l 



1=1 



L 

3V2A 



2kp 

1 



Fo - x 



'■= Ck, 



which is equivalent to solving: 



argmin V] k s.t. V J— < C k . 

{h} k =1 en* k i= i i= i V 

Remark 4. I, e N* fc 4 e]0, 1] ^=1 < fc - So , if Cfe > then thc solution of the constrained 

problem is the solution of the unconstrained problem. In that case, the trivial solution is U = l,Vi. 

Moreover, if = l,Vi is the solution of the constrained problem, then Yli=i \Jif — k < Ck- As a 

consequence, the solution of the unconstrained problem is the solution of the constrained problem if and 
only if Cfe > k. 

We then have two cases to consider: 

Case 1: Ck> k As stated before, the optimum will be trivially reached for k — l,Vi. Now, we need to 
find the optimal over k. It consists in finding: 

min k(C in + C out ) s.t. C k > k. 

feGN* 

Let us have a look at the constraint. 



Cfe > k <^ 



2kp 
3V2A \ V L 

2kp 3V2A 



- \\x — x*\\J > k 
k + \\x — x*\\ 



( Vk- 



6^3 



< 



p \fZ\\x -. 



36A 



3y/2A 



Then: 



:f P ^ VL\\x —x 
36A ^ 3 ^2A 



then there is no solution (i.e. Ck < fe,Vfc). 



" ^ 36A — ^3v^IX ~ tncn ' tne constraint holds for 



' p VX||a;o - ■>■ 
36A 



*IK2 . 



I p VL\\x - x* 
3(i.l 



3V2A 



( Vp _ 



The optimum will then be achieved for the smallest integer (if exists) larger than y^7f^ 
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Case 2: Ck < k As remark [4] shows, the solution of the constrained problem is different from the 
unconstrained one. The solution of this integer optimization problem is hard to compute. In a first step, 
we may relax the problem and solve it as if {h} k = i were continuous variables taking values into [1, +oo[ fc . 
Because both our objective function and the constraints are continuous with respect to {h} k =1 , the optimal 
(over {h} k =1 ) of our problem will precisely lie on the constraint. Our problem now is: 

k k 
argmin lj s.t. 2 = Ck- 

For any i € [1, k], let Hi := l i 2 . Our problem becomes: 

k 2 k 

argmin n i a s.t. n j = Ck- 

{«.}Lie]o,i] fc i=1 i=1 

Introducing the Lagrange multiplier A 6 R, the Lagrangian of this problem writes: 

fc / fc \ 

£(KlLi,A) :=^n7° +A ^ni-C fc . 

i=l / 



And it follows that, Vi G [1, k], when the optimum {n*}f =1 is reached: 



dL (a\ 



And now, plugging into our constraint: 

i=l v 

Hence, for any % G [1, ft], n* = 

— — 

As C fe < fc, it is clear that Vp, n* e]0, 1] and we have, Vi, I* = (^) " . 

We can now plug the optimal I* in our first problem and we now need to find the optimal k* such that: 

k* = argmin C glob (fc,{;*}f =1 ) 

feeN* 



k 

= argmin C ul } I* + kC Q 

fee"'* 



i=i 

k 



A-fcK* 



niin C in ( — ^ ) + fcCout 



argmin fc ( C in (— " + C Q 



A-tK' 



Once again, we can relax this integer optimization problem into a continuous one, assuming k S K + . 
It directly follows that the solution of that relaxed problem is reached when the derivative (w.r.t. k) of 
C g \ b(k,{l*} k =1 ) equals 0. The derivative can be easily computed: 



where C' k is the derivative of Ck w.r.t. k: 

However, giving an analytic form of that zero is difficult. But using any numeric solver, it is very easy to 
find a very good approximation of k* . As described in Remark [3j this allows us to exactly retrieve the 
exact integer minimizer. 



Technical Report V 2.0 



14 



P.Machart, S.Anthoine, L.Baldassarre Optimal Computational Trade-Off of Inexact Proximal Methods 



Proof of Proposition [4] 



In this scenario, we use non-accelerated outer iterations and linear inner iterations. Our optimisation 
problem thus reads: 

mm min C in ^ k + fcC out s.t. — I \\x - x* \\ + 3 J — J < p. 

^^ i=1 i— 1 \ i— 1 / 

We consider A, t = A. The error in the ith inner iteration reads: 

e i = A(l-i) 1 *. (14) 
Hence the corresponding bound on the error: 



P fe <4^o-,1l + 3^y 2A(1 p^j . (15) 



Problem in boils down to: 

fe fc 
axgmin s.t. 5^(1 -7)^ < C fe , 

still with C k = ^= (J^-\\x -x*\\\. 

Case 1: C k > fc-^/1 — 7 identical except for the threshold, which will also impact the interval for k* 
Case 2: C k < ky/1 — 7 For any i g [1, fc], let := (1 — 7)"= . Our problem becomes: 

k k 

argmin — ^""^ In rij s.t. ^""^ = C k . 

Writing again the Lagrangian of this new problem, we obtain the same result: for any i S [1, k], 

* Cfe 

= T- 

This leads to 

21* (f) 



ln(l - 7) ■ 



Following the same reasoning, we now plug this analytic solution of the first optimization problem into 
the second one. This leads to: 

k* = argmin k ( - — - — — — r In ( —j—) + C Q 



ln(l - 7) V k 



This time, the derivative of the continuous relaxation writes: 

dC gloh (k,{l*}Li) _ 2C in f^C k , kC' k 



dk - ln(l-7) \ 1U T + Cfe" 1 ) +C ' 

where C' k is the derivative of C k w.r.t. k: 

3y/A 

The optimum k* of our problem is the (unique) zero of that derivative. 
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Proof of Proposition [5] 



In this scenario, we use accelerated outer iterations and sublinear inner iterations. Our optimisation 
problem thus reads: 



L 

min min C- m } k + kC out s.t. 

k ~i ( fc+1 ) 



2 IFo - x 



We consider A4 = A. The error in the ith inner iteration reads: 

A 

€l ~ W' 



3£' 

1=1 



'2^ 



(16) 



Similarly, for the accelerated case, we have: 

2L 



Pk 



< 



(k + iy 



\\x -x*\ + 



(17) 



Those problems can naturally be extended with the use of accelerated schemes and we get this "error- 
oriented" problem: 



l*o-sl+3^iW— M < 




2L 

k ^ = C in J2k + kC out s.t. ^ 

We will follow the same reasoning as for the non-accelerated case. We will consider this optimization 
problem: 



1 2 A, 



*2iL ( 



i=i 



Let us first have a look at the constraint. 

2L 



(fc + 1) 2 

<^-|jcco — X* 



\x — x || + 
/2k 



3E' 




< J— (fc + 1) 

LZ? ~ V 2L y ' 



I At VI 

< 



P 



^J"Y If " 3^2 VV 2L 



(fc + 1)- II 10 -z*|| 



As in the former case, this can only hold if (k + 1) > y ^rll^o — £*| which is trivial. 

We will now assume again that A t — A for any i. As earlier, we first solve the following problem in 



argmin "S^ U s.t. 'S^ i\l — < — 1=- ■ . 



(A; + 1) - |l*o - z*|| =: D k 



Remark 5. Z ( £ W k =>> ^ e]0, 1] =>> ^i=i *^ < So, if D fc > M^hil 5 then the solution of the 

constrained problem is the solution of the unconstrained problem. In that case, the trivial solution is = 
l,Vi. Moreover, if U = 1, Vi is the solution of the constrained problem, then Yli=i ~ k ( k + 1 ) < ]j k 

As a consequence, the solution of the unconstrained problem is the solution of the constrained problem if 
and only if Dk > k % ■ 
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Case 1: D k > ki - k + 1 '< As stated before, the optimum will be trivially reached for L- t = l,Vi. Now, we 
need to find the optimal over fc. It consists in finding: 

mm k(C in + C out ) s.t. D k > M^til . 

Let us have a look at this constraint. 



D k > * __ (^-(fc + 1) - \\x -x || J > 



fc 2 + fc ( 1 - < -^L ^\\x x* 



3y/AJ 3VA 3\/I 

<s> k 



Then: 

• if K < then there is no solution (i.e. D k < *i^b!2,Vfc). 



if if > then, the constraint holds for fc £ 



The optimum 



will then be achieved for the smallest integer (if exists) larger than i^^= — lj — \[K and smaller 

Case 2: D k < ^ Once again, we fall in the same scenario as in the non-accelerated case. The 
solution of our problem is different from the unconstrained one and we can relax our discrete optimization 
problem to a continuous one. The optimal then precisely lies again on the constraint. We now have: 



min y^/i s.t. ij — = D k . 



For any i e [1, fc], let rn :— il i 2 . Our problem becomes: 

k _2_ k 

min y^f— ) " s.t. y^rii = D k . 



The Lagrangian writes: 



i=i 

k 



i=l 



And it follows that, Vi € [1, fc], when the optimum {n*}^ =1 is reached: 



0£ n , .fa\\^W 



And now, plugging into our constraint: 

k 



Hence, for any i G [1, fc], n* = fc 2 fc ^ fc 1 ^ , giving the corresponding Z* = 
We can now plug the optimal I* in our first problem and we now need to find the optimal fc* such that: 

fc* = argminC g iob(fc, 



kef 



fc(fc + 1) 



Once again, we can relax this integer optimization problem into a continuous one, assuming fc e M. + .It 
directly follows that the solution of that relaxed problem is reached when the derivative (w.r.t. fc) of 
Cgiob(fc,-K*L=i) equals 0. 
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Proof of Proposition [6] 



In this scenario, we use accelerated outer iterations and linear inner iterations. Our optimisation problem 
thus reads: 



L 

min min C in }Ji + kC out s.t. 
k {h}<t =1 (k + 1) 



-l\\ Xo - x *\\+3j2i 
\ »=i 



2^(1-7)'* 



We consider = A. The error in the ith inner iteration reads: 



ei = A(l- 7 ) 



The problem in {U} boils down to: 



(18) 



argmin 

{/i}? =1 eN** i=1 



with Dk = i§A (V^£( fc + 1) - bo - .xll). 



s.t. - 7 )* < D k 

i=i 



(19) 



Case 1: Dk > klyk ^ ^/l — -y identical except for the threshold, which will also impact the interval for 
fc*. 



Case 2: D fe < M^il^T^ 

Relaxing Problem (191 to real numbers, we want to solve: 

k k 



argmin Zj s.t. i(l — 7) 2 ' — Z3j, < 

{/,}? = i6K +fe i=l t=l 

1 - h < 0,Vi. 



(20) 
(21) 



According to the KKT conditions, there exist {/Xi}, i = l,..,k and A, such that the optimum {I*} 
verify: 



(stationarity) 1 + Ai(l — 7) = ln(^/l — 7) — //j = 0, Vz = 1, .., fc 
(primal feasibility) \~] *(1 — 7) ' — -Dfc < 0, 



i=l 

1 - 2? < 0, Vi = l, 
(dual feasibility) A > 0, 

Mi > °7 Vz = 1, .., /c, 

* !I 

(complementary slackness) A(\^i(l — 7) 2 — Dfc) = 0, 

i=i 

fii(l - I*) = 0, Vi = 1, .., k. 
Eq. (25) yields two cases: A = or A > 0. 



(22) 

(23) 

(24) 
(25) 
(26) 

(27) 

(28) 



A = Then Eq. ([22] yields m = l,Vi thus Eq.(|28j) implies I* = 1. All the KKT conditions are thus 
fulfilled if Eq.Q is, i.e. if 

k(k + l) , 

We work here in the case where D & < k ( k + 1 '> — j thus this solution is valid if and only if Dk = 



A > Again, Eq. (25 1 yields two cases: fj,i = or /x, > 0. 
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Subcase 1: fii > 

Then by Eq. Q, we have I* = 1 and by (22} m = 1 + Xi^/T^y ln(y/T^y). Then fa > implies: 

1 

i < 



Subcase 2: fa = 



Then by Eq. (|22j> we have 1 + Ai(l - 7)+ ln(vT^7) = 0, i.c: 

I* = ' V 



Since Eq. (24 1 enforces I* < 1, we have: 

i > 



Conclusion: For A > 0, Eq. (|22j), (|24), (|25j), (|26| and (|28j) are fullfilled all at once if we set: 

1 



For i = 



A % /I^7ln(, 



-1 - 1 : k = 1 



For i = [~- 



In iAln(< 



/ Ai = 1 + Ai^/l - "lnt^/l - ^7) 



fa = 0. 



V ^ 

With these values set for fa and /*, let us now find the value of A. 
Computing A 



We need to fulfill Eq. (j23j) and (j27|. 
Let us define M(X) = \- 



rrl- 



Note that for A > 



(Hi)A/Rh(,M-) 



-, we have: < M(A) < k + 1, and: 



M (A) = 1 A > 



AVT=7ln(Wj^) 



M(A) = n 



nAi/1 — 7 ln( 



1 , <A< 

-J_1 
1-7 > 



for n = 2, .., k + 1. 



(n-l)A^I^7ln(, 



(29) 



Eq. ( 23 ) and ( 27 ) are true if and only if 

k 

= -7)* 

i=l 

M(A)(Af(A)-l) /- fc-M(A) + l 

2 AM^/S 
We define F : M+* -> R by F(A) = m(a)(m(a)-i) yT ^ + fc-M(A)+i _ 

Examining F on each interval where M is constant, it is easy to see that F is continuous and non- 
increasing. Moreover F decreases strictly on [ — j-— ,00), liniA^oo F = and F reaches its 



highest value maxF = fc( - fc 2 f 1 - ) y/1 — 7 on [- 



fcAv^7ln( A 

1 



1-7 > 



(fe+l)A x /T^ln(y' T ^) k\VT=lMsjj^) 

We thus have for all such that < Dy. < *^±D ^/i — j f there exists a unique A such that F(A) = 
and thus all KKT conditions arc fullfilled. 
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To find this value of A as a function of Dk, we first find M(A) from Dk- Notice that 

I 1 _ n(2fc + 1 - n) 



F 



7- 



As Dk < k ^ k ^^ y^i — -y ; there exists a unique integer n in 1, .., k such that 

(n-im + 2-n ) vT — ^ ^ < n(2fc + l-n ) yr — 

Then M(A) = n and the KKT conditions are all fulfilled for: 

fe + 1 - n 



A = 



In particular: 



For i = 1, .., n — 1 : U = 1. 

hi 



For i = n, ..,k 



h = 



(30) 



(31) 



Back to the global problem We now seek to find the the value k* that minimizes the global problem. 
Outisde of the interval defined in Case 1, the global cost is defined by the following. Let us define n{k) 
as the integer verifying Eq. (30). Then 



C glob (k) = kC out + C ln (n(k) 1) + Cm{k n M +1) hi 



In 



M 



k + 1 - n(fc) 



w(fc)(w(fc)-l) 
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